4 - Residual Networks and ODENets [ID:60591]
50 von 115 angezeigt

Residual networks.

Here we have a simple schematic of a block in a vanilla neural network.

There

is an input vector X

it gets passed on through the neural layer where it is multiplied with

a weight matrix, it goes through an activation layer, perhaps it goes through another neuron

layer and then it gives a final output G of X.

This can be reformulated as a block T in a large

neural network where weights theta T and input XT give an output XT plus one

where XT plus one

is G of XT and theta T.

Here you're putting a value in and you're getting a value out.

This is a resonant block.

Here it's a little bit different.

There is an input vector X,

it gets passed on to a neuron layer where it's multiplied with a weight matrix, it goes through

an activation layer and then it goes to perhaps another weight layer.

Finally, the input X is

then added to the final output to give G of X plus X.

So reformulating to the same T plus one in T,

you get the equation that you see here.

So similar to before, you're just putting a value in and

you're getting a value out.

But there is a reason why resonants were way more successful when they

came out.

Firstly, skip connections help information flow through the network by sending a hidden state

X along with the transformation layer

along with the transformation through the layer F of X to get

T plus one, preventing important information from being lost.

This helped stabilize the training as

only the skip connections were sending information in the beginning as the weight layer was being

optimized for.

The resonant block allows for stacking that help forming in very, very deep networks.

And this is because of the nature in which back propagation worked.

To calculate how the loss

function depends on the weights

so dl by d theta

we repeatedly apply the chain rule in our

intermediate gradients, multiplying them along the way.

These multiplications lead to vanishing or

exploding gradients, which simply means that gradients approaching zero or infinity.

So gradient

descent relies on these gradients to move towards a minima.

So a zero or an infinity gradient is

really not going to help it.

So circling back to what happens in a resonant, you pass X zero to G

and theta zero, you pass X zero and theta zero to G and add X zero to get X one.

And then you go

through all the iterations of how many of our resonant blocks you have.

In the end, you get

Teil eines Kapitels:
Time-aware models

Zugänglich über

Offener Zugang

Dauer

00:06:41 Min

Aufnahmedatum

2025-11-04

Hochgeladen am

2025-11-04 16:05:11

Sprache

en-US

A critical comparison between Residual Networks and ODENets